Chromosome-scale genome assembly of an important medicinal plant honeysuckle

Lonicera japonica (honeysuckle) is one of the most important medicinal plants and widely utilized in traditional Chinese medicine. At present, there are many varieties of honeysuckle used in cultivation, among which Sijihua variety are widely cultivated due to its wide adaptability, stress resistance, early flowering and high yield. In this study, we assembled the genome of Sijihua, which was approximately 886.04 Mb in size with a scaffold N50 of 79.5 Mb. 93.28% of the total assembled sequences were anchored to 9 pseudo-chromosomes by using PacBio long reads and Hi-C sequencing data. We predicted 39,320 protein-coding genes and 92.87% of them could be annotated in NR, GO, KOG, KEGG and other databases. In addition, we identified 644 tRNAs, 2,156 rRNAs, 109 miRNAs and 5,502 pseudogenes from the genome. The chromosome-scale genome of Sijihua will be a significant resource for understanding the genetic basis of high stress-resistance, which will facilitate further study of the genetic diversity and accelerate the genetic improvement and breeding of L. japonica.

trait alternations between different varieties. Therefore, we also sequenced and de novo assembled genome of a L. japonica variety that is called 'Sijihua' , which is largely planted in Pingyi county, Shandong province. It's reported that Sijihua are the most cold-resistant variety by studying different varieties of L. japonica in different regions 10 .
Here, we generated a chromosome-scale of the genome assembly of the variety of Sijihua using the combination of PacBio long reads, Illumina reads and the Hi-C sequencing data. Approximately 886.04 Mb genome was assembled with the contig N50 length of 1.58 Mb. A total of 826.50 Mb (93.28%) of the assembled sequences were anchored to 9 pseudo-chromosomes (Table 1). We predicted 39,320 protein-coding genes, and 92.87% of each gene were assigned by BLASTP against NR, GO, KOG, KEGG and other databases. We identified 644 tRNAs, 2,156 rRNAs, 109 miRNAs and 5,502 pseudogenes ( Table 2). We also identified 255,264 simple sequence repeats (SSRs), and 40,252 are polymorphic SSRs ( Table 3). The genome assembly of variety Sijihua is a valuable material to the germplasm resources of L. japonica, and helps researchers to explore the specificity of different varieties. The results also provide valuable clues to the molecular basis of cold-resistance traits of Sijihua and will facilitate further genetic improvements.

Methods
Sample collection, library construction and genome size estimation. High-quality genomic DNA was extracted from young-fresh leaf tissue of Sijihua using CTAB (cetyl trimethylammonium bromide) method, and the samples were collected from the Zhongke Honeysuckle Planting Cooperative in Pingyi County, Shandong, China (Fig. 1). The qualified genomic DNA was broken to the target fragment (350 bp) by ultrasonic

Assembly Sijihua Lj10107428
Genome-sequencing depth (X) www.nature.com/scientificdata www.nature.com/scientificdata/ shock and the Illumina library was constructed through end repairing, adding 3' A tail, ligating adapters and enriching with PCR. The fragment size and quality of the library were detected by 2100 and Q-PCR. Next, the sequencing of the library were performed by using Illumina Novaseq 6000, which finally generated 31.48 million reads, 50.26 Gb of raw data, which covered 61.48 × of the genome. PacBio library were constructed by using BulePippin to screen the target fragment that was interrupted by g-tube, subsequently was sequenced with PacBio Sequel II system. Consequently, the two SMRT-cells generated a total of 4,461,375 reads with N50 size of 29,096 bp. Totally we got 87.61 Gb sequencing data, accounting for 98.88 × of the entire genome. Fresh leaf tissue of honeysuckle was used to construct a library for Hi-C analysis. The fresh tissue was fixed with formaldehyde, attaining interacting loci to be bound to one another, and then cross-linked DNA was digested by restriction enzyme Hind III. Sticky ends were labeled with biotin during its repairing. Next, the interacting DNA fragments were ligated, purified and finally broke them into 300 bp ~ 700 bp fragments. Each ligated DNA fragment was marked with biotin and streptavidin beads were used to pull-down the interacting DNA fragments to complete the Hi-C library construction. The libraries were then sequenced on Illumina Novaseq 6000 platform, which generated 307,357,239 pairs of reads and 91.84 Gb clean data, which cover 103.65x of the genome.
A k-mer (k = 19) analysis was constructed using 61.48 × Illumina data to estimate the genome size, proportion of repeat sequence and heterozygosity 11 . From the 19-kmers distribution, we could estimate the heterozygosity and repeat ratio of the Sijihua genome to be 0.74% and 51.8%, respectively and the estimated genome size was 817.45 Mb ( Fig. 2 and Table 1).  Table 3. SSRs annotation of Sijihua and Lj10107428 genomes. De novo genome assembly. PacBio long reads were error corrected using Canu v1.5 16 and the top 40 x coverage of the longest corrected reads were subsequently assembled by SMARTdenovo 17 . To improve the accuracy of the assembly, the 61.48x Illumina short reads used in the genome survey were applied to three rounds of correction by Pilon v1.22 18 . The high-quality Hi-C reads were used to cluster, order and orient the contigs onto pseudo-chromosomes by using the LACHESIS 19 . We preliminary assembled the PacBio long reads into contig sequences of 886.04 Mb, including 1,159 contigs with N50 of 1.58 Mb, and the longest contig is 12.45 Mb. These contigs were further anchored onto 9 pseudo-chromosomes, accounting for 93.28% of the assembled genome. The final chromosome-scale genome assembly of Sijihua was 886.13 Mb with a scaffold N50 of 79.57 Mb (Table 1).

Repeat annotation.
De novo and structure-based predictions were integrated to annotate repetitive sequences. LTR_FINDER v1.05 20 and RepeatScout v1.05 21 were primarily used to build a de novo repeat sequences library of Sijihua genome, which was classified by using PASTEClassifier v1.0 22 and merged with Repbase v19.06 23 database as the final repeat sequences database. Structure-based predictions were performed by using RepeatMasker v4.05 24 based on the constructed repeat sequences database. We identified 573.89 Mb (64.76%) of repetitive sequences in Sijihua genome. Most of these repeat sequences are Class I (53.57%) retrotransposons, including Copia, Gypsy, LINE and SINE, accounted for 15.93%, 19.14%, 2.61% and 0.34% of the entire genome, respectively. In addition, Class II DNA transposons make up 5.82% of the genome ( Table 2).  Table 2). For non-coding RNAs annotation, microRNA and rRNA were detected by aligning the assembled genome against the to Rfam 37 database using BLASTN. tRNA was identified by tRNAscan-SE 38 . Finally, we totally identified 2,909 non-coding RNAs, including 109 miRNAs, 2,156 rRNAs and 644 tRNAs.

Protein-coding genes prediction and other annotations of the genome. Prediction
The sequence of pseudogenes is similar to that of functional genes, but whose original function is lost due to insertions, deletions and other variants. The predicted protein sequences were used to search for homologous gene sequences on the genome through BLAT 39 alignment, and then GeneWise 39 was used to search for immature stop codons and frameshift mutations in gene sequences. In total, 5,502 pseudogenes were predicted ( Table 2). www.nature.com/scientificdata www.nature.com/scientificdata/ For gene functional annotation, we aligned the predicted protein-coding gene sequences against public functional databases using BLAST 40 54 , SUPERFAMILY 55 , CATH-Gene3D 56 and PANTHER 57 . As a result, more than 92% of protein-coding genes were annotated, and 1,376 conserved motifs and 36,282 domains were identified.

Global genome comparison of the Lj10107428 and Sijihua. Genome comparison between Sijihua
and Lj10107428 was performed by using the NUCmer program embedded in MUMmer4 with parameters -mum -l 40 -c 100, then the delta alignment file was filtered by delta-filter with parameters -1, and finally show-snps was used to identify single nucleotide polymorphisms (SNPs) with parameters -ClrT. SyRI v1.5 59 was used to extract structure variations (SVs) based on the alignment file with tab-delimited text format performed by show-coords program. MCScanX 60 was used to identify gene collinearity, including within and between genomes. In total, we identified 2,996,015 SNPs between Sijihua and Lj10107428. Among them, 14.44% of the SNPs were located in the genic region. Comparison between Lj10107428 and Sijihua genomes, we found 5,150 small insertions/ deletions (indels, length shorter than 500 bp), and more than 4.5 Mb of presence/absence variation (PAV, length longer than 500 bp). Notably, we identified 895 Sijihua-specific genomic sequences (2.61 Mb in total), and 418 Lj10107428-specific genomic sequence (1.84 Mb in total) longer than 500 bp. These PAV segments were unevenly distributed across the chromosomes, and the longest PAV sequence was a 261 Kb segment on chromosome 2 (Fig. 3a). In addition, we also found 156 inversions ( >1,000 bp) and several translocations ( >10 kb) between these two genomes. The genome sequence comparison between Lj10107428 and Sijihua reveals high collinearity. Although some structure variations were detected between these two genomes, they primarily consist of large syntenic block with high degrees of collinearity (Fig. 3b). In addition, we identified 301 large syntenic blocks between Sijihua and Lj10107428, containing 25,128 syntenic genes. Specificity, there is an inverted region of approximate 19 Mb on chromosome 8 between Sijihua and Lj10107428 genomes (Fig. 3b). We also noticed that 561 and 302 NBS-LRR genes located in to Sijihua and Lj10107428 genomes, respectively (Fig. 3b). By comparing the annotated protein-coding genes, we found that 26,937 genes of Sijihua and Lj10107428 were shared by reciprocal best hit of BLAST algorithm with parameter (E-value < 1e-10). More species-specific genes were found in Sijihua (12,383) than that in Lj10107428 (6,361) (Fig. 3c).
From the 18,958 overlapped expressed (FPKM > 0.5) genes between Sijihua and Lj10107428, most of them (>93.4%) did not show differential expression variation at juvenile bud stage. However, hundreds of genes were differential expressed, including 16 NBS-LRR genes. Notably, 11 of 16 NBS-LRR genes were highly expressed in Sijihua at juvenile bud stage. This observation is consistent across other flower development stages (Fig. 3d).

Data Records
The raw data of PacBio, Illumina and Hi-C sequencing were submitted to the National Center for Biotechnology Information (NCBI) SRA database with accession number SRP353698 61 under BioProject accession number PRJNA794868. RNA-seq data were deposited into the NCBI (accession number PRJNA813701) 62 . The assembled genome had been deposited at GenBank with accession number SAMN24662184 63 . In addition, the genome annotation file had been submitted at the Figshare 64 . www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation Evaluation of the genome assembly. To evaluate the quality of genome assembly, bwa (version: 0.7.10-R789; mode: aln) was used to align the Illumine short reads with the reference genome, and 99.75% of the Illumina short reads were mapped to the reference genome CEGMA 65 v2.5 was used to assess the integrity of the final genome assembly. The CEGMA database contained 458 conserved core eukaryotic genes, while our assembled genome contained 439 (95.85%), which suggested that our assembled genome contains most of the core eukaryotic genes. BUSCO 66 v4.0 was used to assess the integrity of our genome assembly by using the Embryophyta database of OrthoDB v10. Of the 1614 expected embryophyta genes, our genome contains 1,556 (97.03%). Together, these three evaluation systems demonstrate the high integrity of our assembled genome (Table 1).
Furthermore, to assess the result of Hi-C assembly, and the number of Hi-C read pairs coverage between any two bins acts as a strength signal of interaction between the two bins. As chromosomal interaction heatmap shown, within each group, it was found that the intensity of interaction at the diagonal position was higher than that at the non-diagonal position (Fig. 4), which was consistent with the principle of Hi-C-assisted genome assembly and proved that the genome assembly was accurate.